RNA Aligner

An example of how a RNA nucleotide sequence might align with a genome

Sequence alignment is an important problem in bioinformatics: simply put, given a genetic sequence, in this case RNA, where is its best match in a larger sequence, such as an entire genome? Finding an exact match can be done in linear time , and even inexact matches can be found in linear time if you bound the number of mismatches. However, the task is not always this simple: nucleotide substitution may have taken place anywhere in either sequence, and long stretches of DNA (known as introns) may have been inserted in the middle of the genomic sequence. Perhaps the biggest issue is that the genome is very large - a human genome is by itself ~3 GB, so even a polynomial space/time algorithm on the size of the genome may be unrealistic.

The approach this algorithm takes is similar to that of Bowtie2 with some simplifications and modifications for the problem at hand. It is able to find an alignment in under 0.5 seconds, and using FM-index compression, space requirements are low - the genome sequence is only needed during initialization and can be discarded when finding an alignment.

The (simplified) steps of the algorithm are as follows:

I mostly wrote this to explore the mechanics behind the operation of various sequence alignment algorithms, and to see how the implementation might vary if you’re working under a different set of assumptions. It is not necessarily intended to be used for real-world applications as is, but with some optimizations it could form the basis for a usable tool.